Intro to R: A hands-on tutorial

Day 0: Intro to statistical programming

Olivia Fiol, Ajjit Narayanan, Amy Rogin, Fay Walker, Aaron R. Williams

Statistical Programming

Motivation: why statistical programming?

  1. Clearly answer questions
  2. Clearly communicate the answer to questions
  3. Document the steps to answering questions

Example 1

What is 2 + 2?

Example 1

What is 2 + 2?

2 + 2
## [1] 4

Example 2

What is the median diamond price with carat > 1 and a “Good” cut?

Example 2

What is the median price of diamonds with carat > 1 and a Good cut?

library(tidyverse)

diamonds %>%
  filter(carat > 1, cut == "Good") %>%
  summarize(median(price))
## # A tibble: 1 x 1
##   `median(price)`
##             <int>
## 1            6412

Example 3

How could increasing the retirement age affect the poverty rates of Hispanic women ages 62 and older?

Example 3

How could increasing the retirement age affect the poverty rates of Hispanic women ages 62 and older?

Via die-seite-des-dr-caligari

Do cool stuff

R Shiny

Fact sheets

Fact sheets

urbnmapr

urbnmapr

urbnthemes

R packages

Six principles

1) Accuracy

Deliberate steps should be taken to minimize the chance of making an error and maximize the chance of catching errors when errors inevitably occur.

2) Computational reproducibility

Computational reproducibility should be embraced to improve accuracy, promote transparency, and prove the quality of analytic work.

Computational reproducibility

  • Replication: the recreation of findings across repeated studies, is a cornerstone of science

  • Reproducibility: the ability to access data, source code, tools, and documentation and recreate all calculations, visualizations, and artifacts of an analysis

  • Computational reproducibility should be the minimum standard for computational social sciences and statistical programming

3) Human interpretability

Code should be written so humans can easily understand what’s happening—even if it occasionally sacrifices machine performance.

4) Portability

Analyses should be designed so strangers can understand each and every step without additional instruction or inquiry from the original analyst.

5) Accessibility

Research and data are non-rivalrous and can be non-excludable. They are public goods that should be widely and easily shared. Decisions about tools, methods, data, and language during the research process should be made in ways that promote the ability of anyone and everyone to access an analysis.

6) Efficiency

Analysts should seek to make all parts of the research process more efficient with clear communication, by adopting best practices, and by managing computation.

Principles

  1. Accuracy
  2. Computational reproducibility
  3. Human interpretability
  4. Portability
  5. Accessibility
  6. Efficiency

Fundamental concepts

Text editor/IDE

  • R <- free, open source programming language
  • RStudio <- for-profit company and Itegrated Development Environment (IDE)

RStudio

The R console

Script

  • A plain text document that contains code and comments
  • Map to the answer
  • .R and .Rmd

Comments

# fivethirtyeight contains bad_drivers
library(fivethirtyeight)

# increase perc_speeding because of systematic underreporting
mutate(bad_drivers, perc_speeding = perc_speeding * 1.2)
  • Clear code avoids the need for describing “what”
  • Comments should focus on “why”

Coding style

“Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.” ~ Hadley Wickham

  • CamelCase
  • camelCase
  • snake_case

tidyverse style guide

R Packages

Collections of R, C, C++, and FORTRAN code that expand the functionality of R.

Comprehensive R Archive Network

  • CRAN was introduced in 1997.
  • Repository of popular R packages with basic standards and quality control.

tidyverse

Comprehensive set of tools for data science

Core: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats

tidyverse

Free text by Hadley Wickham and Garrett Grolemund

Installing and loading packages

# run only once ever(ish) and don't include in scripts
install.packages("tidyverse")
# include at the top of scripts and run once per session
library(tidyverse)

Data structures

Scalars (do not exist in R)

Vectors

## [1] 1 2 3 4 5

Matrices

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Data frames, multidimensional arrays

## # A tibble: 4 x 4
##   name                       awake  brainwt bodywt
##   <chr>                      <dbl>    <dbl>  <dbl>
## 1 Cheetah                     11.9 NA       50    
## 2 Owl monkey                   7    0.0155   0.48 
## 3 Mountain beaver              9.6 NA        1.35 
## 4 Greater short-tailed shrew   9.1  0.00029  0.019

Data types

Character

## [1] "a" "b" "c" "d" "e"

Numeric

## [1] 1 2 3 4 5

Logical

## [1]  TRUE  TRUE FALSE  TRUE FALSE

Factor

## [1] good ok   bad  ok   ok  
## Levels: good ok bad

Missing values

  • NA is R’s encoding for missing values
  • Missing values are contagious
mean(c(1, 2, 3, 4, NA))
## [1] NA

Assignment

R can hold many different objects at the same time. Storing the consequence of code requires assignment (<-).

a <- 2
b <- 2

a + b
## [1] 4
c <- a + b
c
## [1] 4

Functions

Arguments by position

mean(c(1, 2, 3, 4, NA), 0.2, TRUE)
## [1] 2.5

Arguments by name

mean(x = c(1, 2, 3, 4, NA), trim = 0.2, na.rm = TRUE)
## [1] 2.5

Function documentation

?mean

Custom functions

Rule of three: never program something three or more times

test_oddness <- function(x) {
  ifelse(test = x %% 2 == 0, yes = "even!", no = "odd!")
}

test_oddness(1:10)
##  [1] "odd!"  "even!" "odd!"  "even!" "odd!"  "even!" "odd!"  "even!" "odd!" 
## [10] "even!"

Tests

What will it take to convince you that your code is correct?

  1. Assign monthly observations to fiscal years
  • Are there 12 months per year?
  1. Link observations from 2017 to observations from 2018.
  • Do non-matching variables that shouldn’t change change?
  1. Tax calculator
  • Are values that must be positive non-positive?

Tests tips

  1. Write the test first!
  2. Each time you encounter a bug, write a test that will convince you the bug no longer exists.

Organizing an analysis

1) Keep things together

  • If possible, store data, scripts, and outputs in the same place.
  • Sort document data, scripts, and outputs into sub directories with names like data/, scripts/, and outputs/

2) File paths

  • File paths are programmatic references to the locations of files on a computer.
  • RStudio accepts / regardless of operating system.
  • Example: C:/Users/awilliams/Documents/presentations/urbn101-intro-r/lessons

3) Working directories

  • Code needs to be portable!
    • Use relative file paths
  • Programmers can use setwd() to shortcut much of absolute file paths
  • .Rproj are a superior solution only available in R
    • Never use setwd() in R

Ways to learn a programming language

Tips

  1. Read R4DS
  2. Attend the rest of this training
  3. Find a project ASAP
  4. Connect with the community
  5. Use R, use it again, and then use R some more

Schedule

Software check

Check R

  1. Open RStudio
  2. Submit sessionInfo()
  3. Is R Version > 3.6.0?

Check RStudio

  1. In RStudio, click Help > About RStudio
  2. Is the version > 1.0.0

Install the tidyverse

  1. Submit install.packages("tidyverse") to the console
  2. Run library(tidyverse)

A survey of other programming languages

Stata

  • Common users: economists, Nate Silver
  • Strengths: out-of-the-box econometric tools, simple syntax
  • Limitations: proprietary, one data set at a time, inflexible

Photo by StataCorp LP, CC BY-SA 4.0, Unaltered

SAS

  • Common users: veteran researchers, government
  • Strengths: doesn’t use memory
  • Limitations: proprietary, expensive, clunky, inflexible, lacks environments, documentation

Python

  • Users: data scientists, computer scientists
  • Strengths: general purpose programming, extensibility, flexibility
  • Weaknesses: steep learning curve

R

  • Users: statisticians, data scientists, biostatisticians
  • Strengths: extensible, documentation, community, many objects at once
  • Limitations: multiple languages in one

Others

  • SPSS
  • Matlab
  • Julia
  • Rust
  • JavaScript
  • SQL

What you use matters less than how you use it

What you use matters less than how you use it R is the best

Comparison

Source is unknown